Computer-aided probability base calling for arrays of nucleic acid probes on chips

ABSTRACT

A computer system for analyzing nucleic acid sequences is provided. The computer system is used to calculate probabilities for determining unknown bases by analyzing the fluorescence intensities of hybridized nucleic acid probes on biological chips. Additionally, information from multiple experiments is utilized to improve the accuracy of calling unknown bases.

GOVERNMENT RIGHTS NOTICE

[0001] Portions of the material in this specification arose under thecooperative agreement 70NANB5H1031 between Affymetrix, Inc. and theDepartment of Commerce through the National Institute of Standards andTechnology.

COPYRIGHT NOTICE

[0002] A portion of the disclosure of this patent document containsmaterial which is subject to copyright protection. The copyright ownerhas no objection to the xeroxographic reproduction by anyone of thepatent document or the patent disclosure in exactly the form it appearsin the Patent and Trademark Office patent file or records, but otherwisereserves all copyright rights whatsoever.

SOFTWARE APPENDIX

[0003] A Software Appendix comprising twenty one (21) sheets is includedherewith.

BACKGROUND OF THE INVENTION

[0004] The present invention relates to the field of computer systems.More specifically, the present invention relates to computer systems forevaluating and comparing biological sequences.

[0005] Devices and computer systems for forming and using arrays ofmaterials on a substrate are known. For example, PCT applicationWO92/10588, incorporated herein by reference for all purposes, describestechniques for sequencing or sequence checking nucleic acids and othermaterials. Arrays for performing these operations may be formed inarrays according to the methods of, for example, the pioneeringtechniques disclosed in U.S. Pat. No. 5,143,854 and U.S. patentapplication Ser. No. 08/249,188, both incorporated herein by referencefor all purposes.

[0006] According to one aspect of the techniques described therein, anarray of nucleic acid probes is fabricated at known locations on a chipor substrate. A fluorescently labeled nucleic acid is then brought intocontact with the chip and a scanner generates an image file (also calleda cell file) indicating the locations where the labeled nucleic acidsbound to the chip. Based upon the image file and identities of theprobes at specific locations, it becomes possible to extract informationsuch as the monomer sequence of DNA or RNA. Such systems have been usedto form, for example, arrays of DNA that may be used to study and detectmutations relevant to cystic fibrosis, the P53 gene (relevant to certaincancers), HIV, and other genetic characteristics.

[0007] Innovative computer-aided techniques for base calling aredisclosed in U.S. patent application Ser. No. 08/327,525, which isincorporated by reference for all purposes. However, improved computersystems and methods are still needed to evaluate, analyze, and processthe vast amount of information now used and made available by thesepioneering technologies.

SUMMARY OF THE INVENTION

[0008] An improved computer-aided system for calling unknown bases insample nucleic acid sequences from multiple nucleic acid probeintensities is disclosed. The present invention is able to call baseswith extremely high accuracy (up to 98.5%). At the same time, confidenceinformation may be provided that indicates the likelihood that the basehas been called correctly. The methods of the present invention arerobust and uniformly optimal regardless of the experimental conditions.

[0009] According to one aspect of the invention, a computer system isused to identify an unknown base in a sample nucleic acid sequence bythe steps of: inputting a plurality of hybridization probe intensities,each of the probe intensities corresponding to a nucleic acid probe; foreach of the plurality of probe intensities, determining a probabilitythat the corresponding nucleic acid probe best hybridizes with thesample nucleic acid sequence; and calling the unknown base according tothe nucleic acid probe with the highest associated probability.

[0010] According to another aspect of the invention, an unknown base ina sample nucleic acid sequence is called by a base call with the highestprobability of correctly calling the unknown base. The unknown base inthe sample nucleic acid sequence is identified by the steps of:inputting multiple base calls for the unknown base, each of the basecalls having an associated probability which represents a confidencethat the unknown base is called correctly; selecting a base call thathas a highest associated probability; and calling the unknown baseaccording to the selected base call. The multiple base calls aretypically produced from multiple experiments. The multiple experimentsmay be performed on the same chip utilizing different parameters (e.g.,nucleic acid probe length).

[0011] According to yet another aspect of the invention, an unknown basein a sample nucleic acid sequence is called according to multiple basecalls that collectively have the highest probability of correctlycalling the unknown base. The unknown base in the sample nucleic acidsequence is identified by the steps of: inputting multiple probabilitiesfor each possible base for the unknown base, each of the probabilitiesrepresenting a probability that the unknown base is an associated base;producing a product of probabilities for each possible base, eachproduct being associated with a possible base; and calling the unknownbase according to a base associated with a highest product. The multiplebase calls are typically produced from multiple experiments. Themultiple experiments may be performed on the same chip utilizingdifferent parameters (e.g., nucleic acid probe length).

[0012] According to another aspect of the invention, both strands of aDNA molecule are analyzed to increase the accuracy of identifying anunknown base in a sample nucleic acid sequence by the steps of:inputting a first base call for the unknown base, the first base calldetermined from a first nucleic acid probe that is equivalent to aportion of the sample nucleic acid sequence including the unknown base;inputting a second base call for the unknown base, the second base calldetermined from a second nucleic acid probe that is complementary to aportion of the sample nucleic acid sequence including the unknown base;selecting one of the first or second nucleic acid probes that has a baseat an interrogation position which has a high probability of producingcorrect base calls; and calling the unknown base according to theselected one of the first or second nucleic acid probes.

[0013] A further understanding of the nature and advantages of theinventions herein may be realized by reference to the remaining portionsof the specification and the attached drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

[0014]FIG. 1 illustrates an example of a computer system used to executethe software of the present invention;

[0015]FIG. 2 shows a system block diagram of a typical computer systemused to execute the software of the present invention;

[0016]FIG. 3 illustrates an overall system for forming and analyzingarrays of biological materials such as DNA or RNA;

[0017]FIG. 4 is an illustration of the software for the overall system;

[0018]FIG. 5 illustrates the global layout of a chip formed in theoverall system;

[0019]FIG. 6 illustrates conceptually the binding of probes on chips;

[0020]FIG. 7 illustrates probes arranged in lanes on a chip;

[0021]FIG. 8 illustrates a hybridization pattern of a target on a chipwith a reference sequence as in FIG. 7;

[0022]FIG. 9 illustrates the high level flow of the probability basecalling method; and

[0023]FIG. 10 illustrates the flow of the maximum probability method;

[0024]FIG. 11 illustrates the flow of the product of probabilitiesmethod; and

[0025]FIG. 12 illustrates the flow of the wild-type base preferencemethod.

DESCRIPTION OF THE PREFERRED EMBODIMENT CONTENTS

[0026] I. General

[0027] II. Probability Base Calling Method

[0028] III. Maximum Probability Method

[0029] IV. Product of Probabilities Method

[0030] V. Wild-Type Base Preference Method

[0031] VI. Software Appendix

[0032] I. General

[0033] In the description that follows, the present invention will bedescribed in reference to a Sun Workstation in a UNIX environment. Thepresent invention, however, is not limited to any particular hardware oroperating system environment. Instead, those skilled in the art willfind that the systems and methods of the present invention may beadvantageously applied to a variety of systems, including IBM personalcomputers running MS-DOS or Microsoft Windows. Therefore, the followingdescription of specific systems are for purposes of illustration and notlimitation.

[0034]FIG. 1 illustrates an example of a computer system used to executethe software of the present invention. FIG. 1 shows a computer system 1which includes a monitor 3, screen 5, cabinet 7, keyboard 9, and mouse11. Mouse 11 may have one or more buttons such as mouse buttons 13.Cabinet 7 houses a floppy disk drive 14 and a hard drive (not shown)that may be utilized to store and retrieve software programsincorporating the present invention. Although a floppy disk 15 is shownas the removable media, other removable tangible media including CD-ROMand tape may be utilized. Cabinet 7 also houses familiar computercomponents (not shown) such as a processor, memory, and the like.

[0035]FIG. 2 shows a system block diagram of computer system 1 used toexecute the software of the present invention. As in FIG. 1, computersystem 1 includes monitor 3 and keyboard 9. Computer system 1 furtherincludes subsystems such as a central processor 52, system memory 54,I/O controller 56, display adapter 58, serial port 62, disk 64, networkinterface 66, and speaker 68. Other computer systems suitable for usewith the present invention may include additional or fewer subsystems.For example, another computer system could include more than oneprocessor 52 (i.e., a multi-processor system) or memory cache.

[0036] Arrows such as 70 represent the system bus architecture ofcomputer system 1. However, these arrows are illustrative of anyinterconnection scheme serving to link the subsystems. For example,speaker 68 could be connected to the other subsystems through a port orhave an internal direct connection to central processor 52. Computersystem 1 shown in FIG. 2 is but an example of a computer system suitablefor user with the present invention. Other configurations of subsystemssuitable for use with the present invention will be readily apparent toone of ordinary skill in the art.

[0037] The VLSIPS™ technology provides methods of making very largearrays of oligonucleotide probes on very small chips. See U.S. Pat. No.5,143,854 and PCT patent publication Nos. WO 90/15070 and 92/10092, eachof which is incorporated by reference for all purposes. Theoligonucleotide probes on the “DNA chip” are used to detectcomplementary nucleic acid sequences in a sample nucleic acid ofinterest (the “target” nucleic acid).

[0038] The present invention provides methods of analyzing hybridizationintensity files for a chip containing hybridized nucleic acid probes. Ina representative embodiment, the files represent fluorescence data froma biological array, but the files may also represent other data such asradioactive intensity data. Therefore, the present invention is notlimited to analyzing fluorescent measurements of hybridizations but maybe readily utilized to analyze other measurements of hybridization.

[0039] For purposes of illustration, the present invention is describedas being part of a computer system that designs a chip mask, synthesizesthe probes on the chip, labels the nucleic acids, and scans thehybridized nucleic acid probes. Such a system is fully described in U.S.patent application Ser. No. 08/249,188 which has been incorporated byreference for all purposes. However, the present invention may be usedseparately from the overall system for analyzing data generated by suchsystems.

[0040]FIG. 3 illustrates a computerized system for forming and analyzingarrays of biological materials such as RNA or DNA. A computer 100 isused to design arrays of biological polymers such as RNA or DNA. Thecomputer 100 may be, for example, an appropriately programmed sunWorkstation or personal computer or workstation, such as an IBM PCequivalent, including appropriate memory and a CPU as shown in FIGS. 1and 2. The computer system 100 obtains inputs from a user regardingcharacteristics of a gene of interest, and other inputs regarding thedesired features of the array. Optionally, the computer system mayobtain information regarding a specific genetic sequence of interestfrom an external or internal database 102 such as GenBank. The output ofthe computer system 100 is a set of chip design computer files 104 inthe form of, for example, a switch matrix, as described in PCTapplication WO 92/10092, and other associated computer files.

[0041] The chip design files are provided to a system 106 that designsthe lithographic masks used in the fabrication of arrays of moleculessuch as DNA. The system or process 106 may include the hardwarenecessary to manufacture masks 110 and also the necessary computerhardware and software 108 necessary to lay the mask patterns out on themask in an efficient manner. As with the other features in FIG. 1, suchequipment may or may not be located at the same physical site, but isshown together for ease of illustration in FIG. 1. The system 106generates masks 110 or other synthesis patterns such as chrome-on-glassmasks for use in the fabrication of polymer arrays.

[0042] The masks 110, as well as selected information relating to thedesign of the chips from system 100, are used in a synthesis system 112.Synthesis system 112 includes the necessary hardware and software usedto fabricate arrays of polymers on a substrate or chip 114. For example,synthesizer 112 includes a light source 116 and a chemical flow cell 118on which the substrate or chip 114 is placed. Mask 110 is placed betweenthe light source and the substrate/chip, and the two are translatedrelative to each other at appropriate times for deprotection of selectedregions of the chip. Selected chemical reagents are directed throughflow cell 118 for coupling to deprotected regions, as well as forwashing and other operations. All operations are preferably directed byan appropriately programmed computer 119, which may or may not be thesame computer as the computer(s) used in mask design and mask making.

[0043] The substrates fabricated by synthesis system 112 are optionallydiced into smaller chips and exposed to marked receptors. The receptorsmay or may not be complementary to one or more of the molecules on thesubstrate. The receptors are marked with a label such as a fluoresceinlabel (indicated by an asterisk in FIG. 1) and placed in scanning system120. Scanning system 120 again operates under the direction of anappropriately programmed digital computer 122, which also may or may notbe the same computer as the computers used in synthesis, mask making,and mask design. The scanner 120 includes a detection device 124 such asa confocal microscope or CCD (charge-coupled device) that is used todetect the location where labeled receptor (*) has bound to thesubstrate. The output of scanner 120 is an image file(s) 124 indicating,in the case of fluorescein labeled receptor, the fluorescence intensity(photon counts or other related measurements, such as voltage) as afunction of position on the substrate. Since higher photon counts willbe observed where the labeled receptor has bound more strongly to thearray of polymers, and since the monomer sequence of the polymers on thesubstrate is known as a function of position, it becomes possible todetermine the sequence(s) of polymer(s) on the substrate that arecomplementary to the receptor.

[0044] The image file 124 is provided as input to an analysis system 126that incorporates the visualization and analysis methods of the presentinvention. Again, the analysis system may be any one of a wide varietyof computer system(s), but in a preferred embodiment the analysis systemis based on a Sun Workstation or equivalent. The present inventionprovides various methods of analyzing the chip design files and theimage files, providing appropriate output 128. The present invention mayfurther be used to identify specific mutations in a receptor such as DNAor RNA.

[0045]FIG. 4 provides a simplified illustration of the overall softwaresystem used in the operation of one embodiment of the invention. Asshown in FIG. 4, the system first identifies the genetic sequence(s) ortargets that would be of interest in a particular analysis at step 202.The sequences of interest may, for example, be normal or mutant portionsof a gene, genes that identify heredity, or provide forensicinformation. Sequence selection may be provided via manual input of textfiles or may be from external sources such as GenBank. At step 204 thesystem evaluates the gene to determine or assist the user in determiningwhich probes would be desirable on the chip, and provides an appropriate“layout” on the chip for the probes. The chip usually includes probesthat are complementary to a reference nucleic acid sequence which has aknown sequence. A wild-type probe is a probe that will ideally hybridizewith the reference sequence and thus a wild-type gene (also called thechip wild-type) would ideally hybridize with wild-type probes on thechip. The target sequence is substantially similar to the referencesequence except for the presence of mutations, insertions, deletions,and the like. The layout implements desired characteristics such asarrangement on the chip that permits “reading” of genetic sequenceand/or minimization of edge effects, ease of synthesis, and the like.

[0046]FIG. 5 illustrates the global layout of a chip. Chip 114 iscomposed of multiple units where each unit may contain different tilingsfor the chip wild-type sequence. Unit 1 is shown in greater detail andshows that each unit is composed of multiple cells which are areas onthe chip that may contain probes. Conceptually, each unit is composed ofmultiple sets of related cells. As used herein, the term cell refers toa region on a substrate that contains many copies of a molecule ormolecules of interest. Each unit is composed of multiple cells that maybe placed in rows (or “lanes”) and columns. In one embodiment, a set offive related cells includes the following: a wild-type cell 220,“mutation” cells 222, and a “blank” cell 224. Cell 220 contains awild-type probe that is the complement of a portion of the wild-typesequence. Cells 222 contain “mutation” probes for the wild-typesequence. For example, if the wild-type probe is 3′-ACGT, the probes3′-ACAT, 3′-ACCT, 3′-ACGT, and 3′-ACTT may be the “mutation” probes.Cell 224 is the “blank” cell because it contains no probes (also calledthe “blank” probe). As the blank cell contains no probes, labeledreceptors should not bind to the chip in this area. Thus, the blank cellprovides an area that can be used to measure the background intensity.

[0047] Again referring to FIG. 4, at step 206 the masks for thesynthesis are designed. At step 208 the software utilizes the maskdesign and layout information to make the DNA or other polymer chips.This software 208 will control, among other things, relative translationof a substrate and the mask, the flow of desired reagents through a flowcell, the synthesis temperature of the flow cell, and other parameters.At step 210, another piece of software is used in scanning a chip thussynthesized and exposed to a labeled receptor. The software controls thescanning of the chip, and stores the data thus obtained in a file thatmay later be utilized to extract sequence information.

[0048] At step 212 a computer system according to the present inventionutilizes the layout information and the fluorescence information toevaluate the hybridized nucleic acid probes on the chip. Among theimportant pieces of information obtained from DNA chips are theidentification of mutant receptors and determination of genetic sequenceof a particular receptor.

[0049]FIG. 6 illustrates the binding of a particular target DNA to anarray of DNA probes 114. As shown in this simple example, the followingprobes are formed in the array (only one probe is shown for thewild-type probe): 3′-AGAACGT    AGACCGT    AGAGCGT    AGATCGT       .      .       .

[0050] As shown, the set of probes differ by only one base so the probesare designed to determine the identity of the base at that location inthe nucleic acid sequence.

[0051] When a fluorescein-labeled (or other marked) target with thesequence 5′-TCTTGCA is exposed to the array, it is complementary only tothe probe 3′-AGAACGT, and fluorescein will be primarily found on thesurface of the chip where 3′-AGAACGT is located. Thus, for each set ofprobes that differ by only one base, the image file will contain fourfluorescence intensities, one for each probe. Each fluorescenceintensity can therefore be associated with the base of each probe thatis different from the other probes. Additionally, the image file willcontain a “blank” cell which can be used as the fluorescence intensityof the background. By analyzing the five fluorescence intensitiesassociated with a specific base location, it becomes possible to extractsequence information from such arrays using the methods of the inventiondisclosed herein.

[0052]FIG. 7 illustrates probes arranged in lanes on a chip. A referencesequence is shown with five interrogation positions marked with numbersubscripts. An interrogation position is a base position in thereference sequence where the target sequence may contain a mutation orotherwise differ from the reference sequence. The chip may contain fiveprobe cells that correspond to each interrogation position. Each probecell contains a set of probes that have a common base at theinterrogation position. For example, at the first interrogationposition, I₁, the reference sequence has a base T. The wild-type probefor this interrogation position is 3′-TGAC where the base A in the probeis complementary to the base at the interrogation position in thereference sequence.

[0053] Similarly, there are four “mutant” probe cells for the firstinterrogation position, I₁. The four mutant probes are 3′-TGAC, 3′-TGCC,3′-TGGC, and 3′-TGTC. Each of the four mutant probes vary by a singlebase at the interrogation position. As shown, the wild-type and mutantprobes are arranged in lanes on the chip. One of the mutant probes (inthis case 3′-TGAC) is identical to the wild-type probe and thereforedoes not evidence a mutation. However, the redundancy gives a visualindication of mutations as will be seen in FIG. 8.

[0054] Still referring to FIG. 7, the chip contains wild-type and mutantprobes for each of the other interrogation positions I₂-I_(5.) In eachcase, the wild-type probe is equivalent to one of the mutant probes.

[0055]FIG. 8 illustrates a hybridization pattern of a target on a chipwith a reference sequence as in FIG. 7. The reference sequence is shownalong the top of the chip for comparison. The chip includes a WT-lane(wild-type), an A-lane, a C-lane, a G-lane, and a T-lane (or U). Eachlane is a row of cells containing probes. The cells in the WT-lanecontain probes that are complementary to the reference sequence. Thecells in the A-, C-, G-, and T-lanes contain probes that arecomplementary to the reference sequence except that the named base is atthe interrogation position.

[0056] In one embodiment, the hybridization of probes in a cell isdetermined by the fluorescent intensity (e.g., photon counts) of thecell resulting from the binding of marked target sequences. Thefluorescent intensity may vary greatly among cells. For simplicity, FIG.8 shows a high degree of hybridization by a cell containing a darkenedarea. The WT-lane allows a simple visual indication that there is amutation at interrogation position I₄ because the wild-type cell is notdark at that position. The cell in the C-lane is darkened whichindicates that the mutation is from T->G (mutant probe cells arecomplementary so the C-cell indicates a G mutation).

[0057] In practice, the fluorescent intensities of cells near aninterrogation position having a mutation are relatively dark creating“dark regions” around a mutation. The lower fluorescent intensitiesresult because the cells at interrogation positions near a mutation donot contain probes that are perfectly complementary to the targetsequence; thus, the hybridization of these probes with the targetsequence is lower. For example, the relative intensity of the cells atinterrogation positions I₃ and I₅ may be relatively low because none ofthe probes therein are complementary to the target sequence. Althoughthe lower fluorescent intensities reduce the resolution of the data, themethods of the present invention provide highly accurate base callingwithin the dark regions around a mutation and are able to identify othermutations within these regions.

[0058] The present invention calls bases by assigning the bases thefollowing codes: Code Group Meaning A A Adenine C C Cytosine G G GuanineT T(U) Thymine (Uracil) M A or C aMino R A or G puRine W A or T(U) Weakinteraction (2 H bonds) Y C or T(U) pYrimidine S C or G Stronginteraction (3 H bonds) K G or T(U) Keto V A, C or G not T(U) H A, C orT(U) not G D A, G or T(U) not C B C, G or T(U) not A N A, C, G, or T(U)Insufficient intensity to call X A, C, G, or T(U) Insufficientdiscrimination to call

[0059] Most of the codes conform to the IUPAC standard. However, code Nhas been redefined and code X has been added.

[0060] II. Probability Base Calling Method

[0061] The probability base calling method is a method of calling basesin a sample nucleic acid sequence which provides extremely highaccuracy. At the same time, confidence information is provided thatindicates the likelihood that the base has been called correctly. Theprobability base calling method is robust and uniformly optimalregardless of the experimental conditions.

[0062] For simplicity, the probability base calling method will bedescribed as being used to identify one unknown base in a sample nucleicacid sequence. In practice, the method is typically used to identifymany or all the bases in a nucleic acid sequence or sequences.

[0063] In a preferred embodiment, the unknown base will be identified byevaluation of up to four mutation probes. For example, suppose a gene ofinterest has the DNA sequence of 5′-AGAACCTGC-3′ with a possiblemutation at the underlined base position. Suppose that 5-mer probes areto be synthesized for the chip. A representative wild-type probe of5′-TTGGA is complementary to the region of the sequence around thepossible mutation. The “mutation” probes will be the same as thewild-type probe except for a different base at the third position asfollows: 3′-TTAGA, 3′-TTCGA, 3′-TTGGA, and 3′-TTTGA.

[0064] If the fluorescently marked sample sequence is exposed to theabove four mutation probes, the intensity should be highest for theprobe that binds most strongly to the sample sequence. Therefore, if theprobe 3′-TTTGA shows the highest intensity, the unknown base in thesample will generally be called an A mutation because the probes arecomplementary to the sample sequence. Although calling bases accordingto the highest intensity probe is satisfactory in some instances, theaccuracy may be affected by many experimental conditions.

[0065]FIG. 9 shows the high level flow of the probability base callingmethod. At step 302, the system retrieves the intensities for probes atan interrogation position. Although not necessary, the backgroundintensity (e.g., from the blank cell) may be subtracted from each of theobserved intensities. If a DNA sequence is being called, the system mayloop through the flowchart for each base position to be called in thesequence. For simplicity, FIG. 9 shows the method of calling a base at asingle interrogation position in the sample sequence.

[0066] As discussed earlier, each cell on the chip defines an area whichcontains a set of identical probes. After the chip has been exposed to afluorescein-labeled (or other marked) sample sequence, intensityreadings are taken. Intensity readings are taken over the surface of thecell resulting in multiple intensity readings for each cell. The systemcalculates the mean and standard deviation for the intensities measuredfor each cell at step 304. As each cell is associated with a probe type,the term “probe intensity” will generally refer to the mean ofintensities associated with the probe. Although the mean is utilized inthe preferred embodiment, other statistical analysis could be usedincluding an average.

[0067] At step 306, the system calculates the probability that each base(e.g., A, C, G, or T(U)) is at the interrogation position. If we assumethat the base associated with the probe having the highest probeintensity (i.e., best hybridizes with the sample sequence) is thecorrect call, the probability that the unknown base is a certain base isequal to the following:

Prob(X)=Prob(I _(X)>max(I _(Y)))_(Y+X)  (1)

[0068] where X and Y are A, C, G, or T(U); I is the probe intensityassociated with the subscripted base; and max represents the maximum ofthe probe intensities. Thus, the probability that a base at aninterrogation position is base A is the probability that the probeintensity associated with base A (i.e., has A's complement T) is greaterthan the highest probe intensity associated with C, G, and T.

[0069] The probability that the probe intensity associated with base Ais greater than the highest probe intensity associated with the otherbases is approximated by the following: $\begin{matrix}{{{Prob}\left( {I_{X} > {\max \left( I_{Y} \right)}} \right)} \approx \frac{\prod\limits_{\quad}^{\quad}\quad {{Prob}\left( {I_{X} > I_{Y}} \right)}_{Y + X}}{\sum\limits_{{Z = A},C,G,T}^{\quad}\quad {\prod{{Prob}\left( {I_{Z} > I_{Y}} \right)}_{Y + Z}}}} & (2)\end{matrix}$

[0070] where X, Y and Z are A, C, G, or T(U) and Π represents theproduct of the probabilities that I_(X) is greater than each of theother possible bases. Thus, the probability that a base at aninterrogation position is base A is proportional to the product of theprobabilities that the probe intensity associated with base A is greaterthan the probe intensities associated with C, G, and T. In a preferredembodiment, the system normalizes the probabilities so that the sum ofthe probabilities equals 1. As shown above, the system accomplishes thisby dividing each probability by the sum of the probabilities associatedwith the different bases.

[0071] According to the present invention, the probability that a probeintensity associated with a base is greater than the probe intensityassociated with another base is as follows: $\begin{matrix}{\quad {{{Prob}\left( {I_{X} > I_{Y}} \right)} = {\Phi \left( \frac{I_{X} - I_{Y}}{\sqrt{\sigma_{X}^{2} - \sigma_{Y}^{2}}} \right)}_{Y + X}}} & (3)\end{matrix}$

[0072] where X and Y are A, C, C, or T(U) and σ represents the standarddeviation (σ² being the variance) of the intensities measured for thecell associated with the subscripted base. The Φ function is as follows:$\begin{matrix}{{\Phi (X)} = {\int_{- \infty}^{X}{\frac{1}{\sqrt{2\quad \pi}}^{- \frac{y^{2}}{2}}\quad {y}}}} & (4)\end{matrix}$

[0073] which represents the density equation of standard normaldistribution and may be determined by many number of methods known tothose skilled the art.

[0074] Utilizing these equations, the system calculates the probabilitythat the base at the interrogation position is A, the probability thatthe base at the interrogation position is C, the probability that thebase at the interrogation position is G, the probability that the baseat the interrogation position is T(U). In a preferred embodiment,probabilities are normalized so that the sum of these probabilitiesequals 1.

[0075] At step 308, the system determines if the highest probabilityassociated with a base is greater than a probability threshold. In oneembodiment, the value for the probability threshold is 0.8 (forprobabilities that have been normalized). The probability threshold is auser defined value that determines the threshold that a probabilityshould cross before the base is called. If the probabilities arenormalized so that their sum is equal to 1, the probability thresholdwill be in the range of 0.25 to 1.0. The use of a probability thresholdis not necessary but allows the user to select the confidence of theresulting base calls. It should be noted that a probability threshold of0.25 corresponds to calling the base associated with the highestprobability (i.e., no threshold).

[0076] If the highest probability is greater than the probabilitythreshold, the system calls the base as the base associated with thehighest probability at step 310. Thus, if the probes in the G-lane cellhad the highest probability and the probability is greater than theprobability threshold, the system would call the base at theinterrogation position a C since the probes are complementary to thesample sequence. At step 312, the confidence (i.e., the likelihood thatthe base is called correctly) is set equal to the highest probability.

[0077] At step 314, the system creates a sum of probabilities by addingthe highest probability to the next highest probability. The sumrepresents the probability that the base is either of the basesassociated with the two highest probabilities. The system thendetermines if the sum is greater than the probability threshold at step316. If the sum is greater than the probability threshold, the systemcalls the base as an ambiguity code representing the bases that areassociated with two highest probabilities. Thus, if the probabilitiesassociated with bases A and C the two highest probabilities and theirsum is greater than the probability threshold, the system would call thebase at the interrogation position an M (meaning A or C). Since theprobes are complementary to the sample sequence, the probabilitiesassociated with bases A and C are the probabilities of the probes in theT- and G-lane cells, respectively. At step 320, the confidence is setequal to the sum of the probabilities that exceed the probabilitythreshold.

[0078] If the sum is not greater than the probability threshold, thesystem adds the next highest probability to the sum of probabilities at314 and the sum is compared to the probability threshold at step 316.When the sum is greater than the probability threshold, the system callsthe base as an ambiguity code representing all the bases that areassociated with probabilities included in the sum. As before, theconfidence is set equal to the sum of the probabilities that exceed theprobability threshold.

[0079] As an example of the probability base calling method, suppose aknown nucleic acid sequence 5′-ACTGTAGGG is to be called. After thesequence is labeled and exposed to a DNA chip, an image file isgenerated that has the fluorescent intensities (e.g., photon counts)associated with each cell on the chip. The mean and standard deviationare calculated and are as follows for each interrogation position:IntPos Mean A Mean C Mean G Mean T 1 176.8 65.9 73.4 51.7 2 57.9 119.260.5 56.5 3 53.9 60.2 54.8 81.3 4 55.1 53.9 76.0 56.0 5 50.8 52.3 53.159.0 6 54.4 53.0 52.6 51.2 7 50.9 51.8 52.5 51.6 8 52.1 53.2 53.4 50.7 951.1 50.9 51.1 50.8 IntPos StDev A StDev C StDev G StDev T 1 18.2 8.311.4 5.0 2 8.1 18.1 10.5 6.3 3 6.8 9.2 5.8 16.6 4 5.6 6.7 12.6 8.0 5 5.45.0 5.7 8.8 6 5.8 5.5 5.7 4.7 7 6.1 5.8 6.4 5.9 8 5.1 5.6 5.5 6.1 9 6.16.1 5.9 6.2

[0080] The mean and standard deviations above represent the complementsto the chip cell. For example, the mean and standard deviation for Awere determined from the intensities associated with the cell thatcontained probes having the base T at the interrogation position.

[0081] These means and standard deviations were utilized to produce thefollowing probabilities according to the equations set forth above:IntPos Prob A Prob C Prob G Prob T 1 1 0 5.2e−7 0 2 2.3e−4 1 9.2e−48.9e−5 3 0.01 0.077 0.013 0.9 4 0.019 0.013 0.93 0.033 5 0.056 0.11 0.160.68 6 0.42 0.25 0.22 0.11 7 0.18 0.26 0.33 0.24 8 0.21 0.32 0.35 0.12 90.26 0.24 0.26 0.23

[0082] The probabilities have been normalized so that the sum of theprobabilities associated with the bases at each interrogation positionequals 1.

[0083] If the bases are called according to the highest probability(also equal to a threshold of 0.25 in this case), the bases would becalled as follows with the associated confidence: IntPos BaseCall Confid1 A 1 2 C 1 3 T 0.9 4 G 0.93 5 T 0.68 6 A 0.42 7 G 0.33 8 G 0.35 9 G0.26

[0084] As the sample nucleic acid was known to be 5′-ACTGTAGGG, thesequence was correctly called by the base probability method.Importantly, the confidence values indicate the likelihood that eachbase call is correct.

[0085] If the bases are called with a probability threshold of 0.5, thebases would be called as follows: IntPos Basecall Confid 1 A 1 2 C 1 3 T0.9 4 G 0.93 5 T 0.68 6 M 0.67 7 S 0.59 8 S 0.67 9 R 0.52

[0086] where the ambiguity codes M=A or C, S=C or G and R=A or Gaccording to the IUPAC codes. As shown, all the confidence values areabove 50% for each base call.

[0087] Advantages of the probability base calling method include that itis extremely accurate in calling bases of sample nucleic acid sequencesand provides a confidence value of the accuracy of the base call. Themethod is robust and uniformly optimal regardless of experimentalconditions. Additionally, the probability base calling method is capableof accurately calling bases and identifying mutations near othermutations.

[0088] III. Maximum Probability Method

[0089] The present invention provides a maximum probability method ofincreasing the accuracy of base calling by analyzing multipleexperiments preformed on a DNA or RNA molecule. The multiple experimentsmay be repetitions of the same experiment or may vary by the number ofprobes on the chip, wash (or salt) concentration, tiling method, and thelike. Additionally, the multiple experiments may include experimentspreformed on the sense and anti-sense strands of the sample nucleic acidsequence. Although in a preferred embodiment, this method is performedin conjunction with probability base calling, the method may be readilyused with other base calling methods including those disclosed in U.S.patent application Ser. No. 08/327,525.

[0090]FIG. 10 shows the flow of the maximum probability method. Themethod will be described as sequencing a sample nucleic acid sequence.At step 352, base calling is performed on data from multiple experimentson the sample nucleic acid sequence.

[0091] The system identifies an interrogation position in the samplenucleic acid sequence at step 354. The system then identifies the basethat was called with the highest probability among the multipleexperiments. In a preferred embodiment, the highest probability isdetermined by the probability base calling method. In other embodiments,for example, the base that had the highest associated intensity may beidentified. At step 356, the system calls the base at the interrogationposition as the base with the highest probability. The probability alsorepresents the confidence that the base has been called correctly.

[0092] At step 358, the system determines if base calling should beperformed on another interrogation position. If so, the system proceedsto step 354 to retrieve the next interrogation position.

[0093] As an example, six known nucleic acid sequence clones of HIV DNAwere labeled and exposed to the HIV418 chip available from Affymetrix,Inc., Santa Clara, Calif. The multiple experiments for each HIV cloneincluded sequencing the sense and anti-sense strands of the HIV clone.The following shows the percentage error of probability base calling forthe sense and anti-sense strands of the HIV clones: HIV Clone SenseAnti-sense MaxProb 4 mut 18 3.08 1.83 1.73 HXB 2.02 1.44 1.35 NY5 2.311.44 1.63 NY5-215 2.60 1.15 1.25 NYS-5mut 2.88 1.54 1.44 pPol19 3.172.98 1.83 Average 2.68 1.73 1.54

[0094] As shown, the probability base calling method had a 3.08 percenterror for sequencing the bases of the 4 mut 18 sense strand. Theprobability base calling method had a 1.83 percent error for sequencingthe bases of the 4 mut 18 anti-sense strand. However, if the maximumprobability method is utilized, the error percentage drops to 1.73. Moresignificantly, the table above shows that the average of the errorpercentages reveals that the maximum probability method provides a 1.54percent error—which translates to a 98.46 percent correct base calling.This percentage is a significant improvement over present day chipsequencing methods.

[0095] The maximum probability method provides a significant improvementin base calling correctness by advantageously combining the results frommultiple experiments. Although the method has been described assequencing a sample nucleic acid sequence, the method may be utilized tosequence genes or call individual bases.

[0096] IV. Product of Probabilities Method

[0097] The present invention provides a product of probabilities methodof increasing the accuracy of base calling by analyzing multipleexperiments preformed on a DNA or RNA molecule. The multiple experimentsmay be repetitions of the same experiment or may vary by the number ofprobes on the chip, wash (or salt) concentration, tiling method, and thelike. Additionally, the multiple experiments may include experimentspreformed on the sense and anti-sense strands of the sample nucleic acidsequence. Although in a preferred embodiment, this method is performedin conjunction with probability base calling, the method may be readilyused with other base calling methods including those disclosed in U.S.patent application Ser. No. 08/327,525.

[0098]FIG. 11 shows the flow of the product of probabilities method. Themethod will be described as sequencing a sample nucleic acid sequence.At step 402, base calling is performed on data from multiple experimentson the sample nucleic acid sequence.

[0099] The system identifies an interrogation position in the samplenucleic acid sequence at step 404. The system then multiplies theprobabilities associated with each base among the experiments to producea product at step 406. For example, the system identifies theprobability that the base at the interrogatory position is base A fromeach experiment. The system multiplies each of these percentages toproduce a product of probabilities for A. The system similarly producesa product of probabilities for C, G and T. Optionally, the system thennormalizes each of the product of probabilities by dividing each by thesum of the products of probabilities for A, C, G, and T. In this way,the sum of the resulting products of probabilities will equal 1.

[0100] In a preferred embodiment, the highest probability is determinedby the probability base calling method. In other embodiments, forexample, the base that had the highest associated intensity may beidentified. At step 408, the system calls the base at the interrogationposition as the base with the highest product of probabilities.

[0101] At step 410, the system determines if base calling should beperformed on another interrogation position. If so, the system proceedsto step 404 to retrieve the next interrogation position.

[0102] The product of probabilities method provides a significantimprovement in base calling correctness by advantageously combining theresults from multiple experiments. Although the method has be describedas sequencing a sample nucleic acid sequence, the method may be utilizedto sequence genes or call individual bases.

[0103] V. Wild-Type Base Preference Method

[0104] The present invention provides methods of increasing the accuracyof base calling by analyzing both strands of a DNA (or complementarystrands of RNA) molecule and calling the base according to the chipwild-type base. The accuracy is improved because some bases arecorrectly identified more often depending on the wild-type base on thechip. By analyzing both strands of the DNA molecule, the base callingmethod can better utilize this information to improve the accuracy ofbase calling. In a preferred embodiment, this method is performed inconjunction with probability base calling but others may be utilized.

[0105] A molecule of DNA is composed of two complementary strands ofdeoxyribonucleotides (bases). Before the sequence of the DNA isevaluated, the DNA molecule is cleaved into its two complementarystrands. One strand is then cloned to produce enough nucleic acidsequences to be labeled and sequenced (called) according to the methodsdisclosed herein. For identification purposes, this strand of DNA willbe called the “sense” strand.

[0106] According to the present invention, the other strand, the“anti-sense” strand, is also cloned, labeled, and sequenced. Throughanalysis of known nucleic acid sequences, it has been determined thatwhen the wild-type base at the interrogation position on the chip is Aor G, the resulting base call is correct a higher percentage of thetime. Conversely, it has been determined that when the wild-type base atthe interrogation position on the chip is C or T, the resulting basecall is incorrect a higher percentage of the time. For example, when thewild-type base at the interrogation position on the chip is T, theresulting base call is incorrect (i.e., the base is miscalled) up tothree times more often than the other chip wild-type bases.

[0107] It is believed that some of the inaccuracy may be caused by thefluorescein label which is bound to the base thymine in someembodiments. Additionally, some of the inaccuracy may be caused by thefact that both C and T are pyrimidines. Whatever the cause, thisinformation is utilized to increase the accuracy of base callingmethods.

[0108] As the sense and anti-sense nucleic acid strands arecomplementary, the base calling method should indicate complementarybases for the two strands. For example, if the sense strand has a base Aat an interrogation position, the base calling method should indicatethe base is A. However, the anti-sense strand will have a base T at acorresponding interrogation position, so the base calling method shouldindicate the base is T.

[0109]FIG. 12 illustrates the flow of the wild-type base preferencemethod. At step 452, base calling is performed on the sense strand tocall a base at an interrogation position in the sense strand. Basecalling is performed on the anti-sense strand to call the base at theinterrogation position in the sense strand at step 454. The sense andanti-sense strands may be analyzed separately or concurrently.

[0110] The system identifies an interrogation position in the samplenucleic acid sequence at step 456. At step 458, the system calls thebase at the interrogation position according to the strand that has achip wild-type base A or G at the interrogation position. Thus, if theanti-sense strand chip wild-type at the interrogation position is G, thebase is called according to the anti-sense strand.

[0111] As an example, assume the base call utilizing the sense strandcalls the base at the interrogation position is an A. Assume also thebase call utilizing the anti-sense strand calls the base at theinterrogatory position a C (which translates to a G for the sense strandas the sense and anti-sense strands are complementary). If the chipwild-type base for the sense strand is A (which means the chip wild-typebase for the anti-sense strand is T), the system calls the base an Aaccording to the base call that utilizes the sense strand because thechip wild-type associated with the sense strand is an A or G.

[0112] At step 460, the system determines if base calling should beperformed on another interrogation position. If so, the system proceedsto step 456 to retrieve the next interrogation position.

[0113] Although the wild-type base preference method has been describedas giving a higher priority to A and G as the chip wild-type, otherbases may be preferred in other embodiments. Accordingly, the method isnot limited to preference of any specific chip wild-type bases.

[0114] VI. Software Appendix

[0115] The Software appendix (copyright Affymetrix, Inc.) provide C++source code for implementing the present invention. The source code iswritten for a Sun Workstation.

[0116] The above description is illustrative and not restrictive. Manyvariations of the invention will become apparent to those of skill inthe art upon review of this disclosure. Merely by way of example, whilethe invention is illustrated with particular reference to the evaluationof DNA (natural or unnatural), the methods can be used in the analysisfrom chips with other materials synthesized thereon, such as RNA. Thescope of the invention should, therefore, be determined not withreference to the above description, but instead should be determinedwith reference to the appended claims along with their full scope ofequivalents.

What is claimed is:
 1. In a computer system, a method of calling an unknown base in a sample nucleic acid sequence, the method comprising the steps of: inputting a plurality of hybridization probe intensities, each of the probe intensities corresponding to a nucleic acid probe; for each of the plurality of probe intensities, determining a probability that the corresponding nucleic acid probe best hybridizes with the sample nucleic acid sequence; and calling the unknown base according to the nucleic acid probe with the highest associated probability.
 2. The method of claim 1, wherein the highest probability indicates a confidence that the unknown base is called correctly.
 3. The method of claim 1, wherein each nucleic acid probe has a different base at an interrogation position.
 4. The method of claim 3, wherein each probability indicates the likelihood that the unknown base is complementary to a base at the interrogation position in the corresponding nucleic acid probe.
 5. The method of claim 1, wherein each of the plurality of probe intensities is calculated from a plurality of intensity values.
 6. The method of claim 5, wherein each of the plurality of probe intensities is a mean of the plurality of intensity values.
 7. The method of claim 5, further comprising the step of calculating a standard deviation for each of the plurality of intensity values associated with a probe intensity.
 8. The method of claim 1, further comprising the step of comparing the highest probability to a probability threshold.
 9. The method of claim 8, wherein the unknown base is called according to the nucleic acid probe associated with the highest probability if the highest probability exceeds the probability threshold.
 10. The method of claim 8, further comprising the steps of: producing a sum of the highest probability and a next highest probability; comparing the sum to the probability threshold; and calling the unknown base according to the nucleic acid probes associated with the highest and next highest probabilities if the sum exceeds the probability threshold.
 11. The method of claim 10, wherein the sum indicates a confidence that the unknown base is called correctly if the sum exceeds the probability threshold.
 12. The method of claim 1, wherein each probability is determined by an equation: Prob(X)=Prob(I _(X)>max(I _(Y)))_(Y+X) where X and Y are A, C, G, or T(U); I is a probe intensity associated with a subscripted base; and max represents a maximum of the probe intensities.
 13. The method of claim 12, wherein the equation includes: Prob(I _(X)>max(I _(Y)))ΠProb(I _(X) >I _(Y))_(Y+X) where X and Y are A, C, G, or T(U) and U represents a product of probabilities that I_(X) is greater than other of the probabilities.
 14. The method of claim 13, wherein the equation includes: ${{Prob}\left( {I_{X} > I_{Y}} \right)} = {\Phi \left( \frac{I_{X} - I_{Y}}{\sqrt{\sigma_{X}^{2} - \sigma_{Y}^{2}}} \right)}_{Y + X}$

where X and Y are A, C, G, or T(U) and σ represents a standard deviation (σ² being a variance) of intensities for a subscripted base.
 15. The method of claim 14, wherein the equation includes: ${\Phi (X)} = {\int_{- \infty}^{X}{\frac{1}{\sqrt{2\quad \pi}}^{- \frac{y^{2}}{2}}\quad {y}}}$

where Φ represents a density equation of standard normal distribution.
 16. The method of claim 1, wherein the unknown base is called as being A, C, G, or T(U).
 17. A computer program that calls an unknown base in a sample nucleic acid sequence, comprising: code that receives as input a plurality of hybridization probe intensities, each of the probe intensities corresponding to a nucleic acid probe; code that determines for each of the plurality of probe intensities a probability that the corresponding nucleic acid probe best hybridizes with the sample nucleic acid sequence; and code that calls the unknown base according to the nucleic acid probe with the highest associated probability; wherein the codes are stored on a tangible medium.
 18. In a computer system, a method of calling an unknown base in a sample nucleic acid sequence, the method comprising the steps of: inputting a plurality of base calls for the unknown base, each of the base calls having an associated probability which represents a confidence that the unknown base is called correctly; selecting a base call that has a highest associated probability; and calling the unknown base according to the selected base call.
 19. The method of claim 18, further comprising the step of performing a base call of the unknown base according to hybridization of nucleic acid probes with the sample nucleic acid sequence.
 20. The method of claim 18, wherein the base calls are determined from a plurality of experiments.
 21. A computer program that calls an unknown base in a sample nucleic acid sequence, comprising: code that receives as input a plurality of base calls for the unknown base, each of the base calls having an associated probability which represents a confidence that the unknown base is called correctly; selecting a base call that has a highest associated probability; and calling the unknown base according to the selected base call; wherein the codes are stored on a tangible medium.
 22. In a computer system, a method of calling an unknown base in a sample nucleic acid sequence, the method comprising the steps of: inputting a plurality of probabilities for each possible base for the unknown base, each of the probabilities representing a probability that the unknown base is an associated base; producing a product of probabilities for each possible base, each product being associated with a possible base; and calling the unknown base according to a base associated with a highest product.
 23. The method of claim 22, further comprising the step of calculating the probabilities according to hybridization of nucleic acid probes with the sample nucleic acid sequence.
 24. The method of claim 22, wherein the probabilities are determined from a plurality of experiments.
 25. A computer program that calls an unknown base in a sample nucleic acid sequence, comprising: code that receives as input a plurality of probabilities for each possible base for the unknown base, each of the probabilities representing a probability that the unknown base is an associated base; code that produces a product of probabilities for each possible base, each product being associated with a possible base; and code that calls the unknown base according to a base associated with a highest product; wherein the codes are stored on a tangible medium.
 26. In a computer system, a method of calling an unknown base in a sample nucleic acid sequence, the method comprising the steps of: inputting a first base call for the unknown base, the first base call determined from a first nucleic acid probe that is equivalent to a portion of the sample nucleic acid sequence including the unknown base; inputting a second base call for the unknown base, the second base call determined from a second nucleic acid probe that is complementary to a portion of the sample nucleic acid sequence including the unknown base; selecting one of the first or second nucleic acid probes that has a base at an interrogation position which has a high probability of producing correct base calls; and calling the unknown base according to the selected one of the first or second nucleic acid probes.
 27. The method of claim 26, further comprising the step of calculating the first and second base calls according to hybridization of nucleic acid probes with the sample nucleic acid sequence.
 28. The method of claim 26, wherein the base that has a high probability of producing correct base calls is A or G.
 29. A computer program that calls an unknown base in a sample nucleic acid sequence, comprising: code that receives as input first and second base calls for the unknown base, the first base call determined from a first nucleic acid probe that is equivalent to a portion of the sample nucleic acid sequence including the unknown base and the second base call determined from a second nucleic acid probe that is complementary to a portion of the sample nucleic acid sequence including the unknown base; code that selects one of the first or second nucleic acid probes that has a base at an interrogation position which has a high probability of producing correct base calls; and code that calls the unknown base according to the selected one of the first or second nucleic acid probes; wherein the codes are stored on a tangible medium. 